Classifying and alleviating the communication overheads in matrix computations on large-scale NUMA multiprocessors
نویسندگان
چکیده
Large-scale, shared-memory multiprocessors have non-uniform memory access (NUMA) costs. The high communication cost dominates the source of matrix computations' execution. Memory contention and remote memory access are two major communication overheads on large-scale NUMA multiprocessors. However, previous experiments and discussions focus either on reducing the number of remote memory accesses or on alleviating memory contention overhead. In this paper, we propose a simple but eective processor allocation policy, called rectangular processor allocation, to alleviate both overheads at the same time. The policy divides the matrix elements into a certain number of rectangular blocks, and assigns each processor to compute the results of one rectangular block. This methodology may reduce a lot of unnecessary memory accesses to the memory modules. After running many matrix computations under a realistic memory system simulator, we con®rmed that at least one-fourth of the communication overhead may be reduced. Therefore, we conclude that rectangular processor allocation policy performs better than other popular policies, and that the combination of rectangular processor allocation policy with software interleaving data allocation policy is a better choice to alleviate communication overhead. Ó 1998 Elsevier Science Inc. All rights reserved.
منابع مشابه
Clustered affinity scheduling on large-scale NUMA multiprocessors
Modern shared-memory multiprocessors have high and non-uniform memory access (NUMA) costs. The communication cost gradually dominates the source of parallel applications’ execution. Algorithms based on affinity, like affinity scheduling algorithm (AFS), perform better than dynamic algorithms, such as guided self-scheduling (GSS) and trapezoid selfscheduling (TSS). However, as the number of proc...
متن کاملExperiences with Data Distribution on NUMA Shared Memory Multiprocessors
The choice of a good data distribution scheme is critical to performance of data-parallel applications on both distributed memory multiprocessors and NUMA shared memory multiprocessors. The high cost of interprocessor communication in distributed memory multiprocessors makes the minimization of communications the predominant issue in selecting data distributionschemes. However, on NUMA multipro...
متن کاملHierarchical loop scheduling for clustered NUMA machines
Loop scheduling is an important issue in the development of high performance multiprocessors. As modern multiprocessors have high and non-uniform memory access (NUMA) costs, the communication costs dominate the execution of parallel programs. Previous anity algorithms perform better than dynamic algorithms under non-clustered NUMA multiprocessors, but they suer heavy overheads when migrating ...
متن کاملMemory Latency Reduction with Fine-grain Migrating Threads in Numa Shared-memory Multiprocessors
In order to fully realize the potential performance benefits of large-scale NUMA shared memory multiprocessors, efficient techniques to reduce/tolerate long memory access latencies in such systems are to be developed. This paper discusses the concept, software and hardware support for memory latency reduction through fine-grain non-transparent thread migration, referred to as mobile multithread...
متن کاملMultiprogrammed Parallel Application Scheduling in NUMA Multiprocessors
The invention, acceptance, and proliferation of multiprocessors are primarily a result of the quest to increase computer system performance. The most promising features of multiprocessors are their potential to solve problems faster than previously possible and to solve larger problems than previously possible. Large-scale multiprocessors offer the additional advantage of being able to execute ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Journal of Systems and Software
دوره 44 شماره
صفحات -
تاریخ انتشار 1998